With the data set of your choice, after ensuring the variable(s) you’re exploring are indeed factors, you are expected to:
Drop Oceania. Filter the Gapminder data to remove observations associated with the continent of Oceania. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.
First let’s load the gapminder dataset and packages:
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library())
Now let’s see how many levels are in the gapminder dataset by continent:
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5
Let’s make a new variable, gapminder_drop_oceania which filters out by the continent Oceania:
gapminder_drop_oceania <- gapminder %>% filter(continent != "Oceania")
levels(gapminder_drop_oceania$continent) #even though we filtered it out, it is not dropped yet
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Using the droplevels() function we could now pipe the dataset to filter it out:
gapminder_drop_oceania2 <- gapminder %>% filter(continent != "Oceania") %>% droplevels()
levels(gapminder_drop_oceania2$continent) #now you can see we have dropped the Oceania level
## [1] "Africa" "Americas" "Asia" "Europe"
Reorder the levels of country or continent. Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.
Let’s filter out a variable such as the continent, Americas, and the year 2002:
gap_Americas_2002 <- gapminder %>%
filter(year == 2002, continent == "Americas")
Now let’s reorder by decending life expectancy:
gap_Americas_2002 %>%
mutate(country = fct_reorder(country, desc(lifeExp))) %>%
ggplot(aes(lifeExp, country)) +
geom_point(colour = "Red") +
labs(y = "Country", x = "Life Expectancy")
It’s interesting to note that the lowest life expectancy in the Americas during 2002 was Haiti and the highest was Canada.
Experiment with one or more of write_csv()/read_csv() (and/or TSV friends), saveRDS()/readRDS(), dput()/dget(). Create something new, probably by filtering or grouped-summarization of Singer or Gapminder. I highly recommend you fiddle with the factor levels, i.e. make them non-alphabetical (see previous section). Explore whether this survives the round trip of writing to file then reading back in.
We will now write the file to the working directory:
write_csv(gap_Americas_2002,"Americas_2002", col_names = TRUE)
We can also read it back using the read_csv function:
read_back <- read_csv("Americas_2002")
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
To prove that it worked lets check:
knitr::kable(read_back)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Argentina | Americas | 2002 | 74.340 | 38331121 | 8797.641 |
| Bolivia | Americas | 2002 | 63.883 | 8445134 | 3413.263 |
| Brazil | Americas | 2002 | 71.006 | 179914212 | 8131.213 |
| Canada | Americas | 2002 | 79.770 | 31902268 | 33328.965 |
| Chile | Americas | 2002 | 77.860 | 15497046 | 10778.784 |
| Colombia | Americas | 2002 | 71.682 | 41008227 | 5755.260 |
| Costa Rica | Americas | 2002 | 78.123 | 3834934 | 7723.447 |
| Cuba | Americas | 2002 | 77.158 | 11226999 | 6340.647 |
| Dominican Republic | Americas | 2002 | 70.847 | 8650322 | 4563.808 |
| Ecuador | Americas | 2002 | 74.173 | 12921234 | 5773.045 |
| El Salvador | Americas | 2002 | 70.734 | 6353681 | 5351.569 |
| Guatemala | Americas | 2002 | 68.978 | 11178650 | 4858.347 |
| Haiti | Americas | 2002 | 58.137 | 7607651 | 1270.365 |
| Honduras | Americas | 2002 | 68.565 | 6677328 | 3099.729 |
| Jamaica | Americas | 2002 | 72.047 | 2664659 | 6994.775 |
| Mexico | Americas | 2002 | 74.902 | 102479927 | 10742.441 |
| Nicaragua | Americas | 2002 | 70.836 | 5146848 | 2474.549 |
| Panama | Americas | 2002 | 74.712 | 2990875 | 7356.032 |
| Paraguay | Americas | 2002 | 70.755 | 5884491 | 3783.674 |
| Peru | Americas | 2002 | 69.906 | 26769436 | 5909.020 |
| Puerto Rico | Americas | 2002 | 77.778 | 3859606 | 18855.606 |
| Trinidad and Tobago | Americas | 2002 | 68.976 | 1101832 | 11460.600 |
| United States | Americas | 2002 | 77.310 | 287675526 | 39097.100 |
| Uruguay | Americas | 2002 | 75.307 | 3363085 | 7727.002 |
| Venezuela | Americas | 2002 | 72.766 | 24287670 | 8605.048 |
kable(read_back) %>%
kable_styling("striped", full_width = F)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Argentina | Americas | 2002 | 74.340 | 38331121 | 8797.641 |
| Bolivia | Americas | 2002 | 63.883 | 8445134 | 3413.263 |
| Brazil | Americas | 2002 | 71.006 | 179914212 | 8131.213 |
| Canada | Americas | 2002 | 79.770 | 31902268 | 33328.965 |
| Chile | Americas | 2002 | 77.860 | 15497046 | 10778.784 |
| Colombia | Americas | 2002 | 71.682 | 41008227 | 5755.260 |
| Costa Rica | Americas | 2002 | 78.123 | 3834934 | 7723.447 |
| Cuba | Americas | 2002 | 77.158 | 11226999 | 6340.647 |
| Dominican Republic | Americas | 2002 | 70.847 | 8650322 | 4563.808 |
| Ecuador | Americas | 2002 | 74.173 | 12921234 | 5773.045 |
| El Salvador | Americas | 2002 | 70.734 | 6353681 | 5351.569 |
| Guatemala | Americas | 2002 | 68.978 | 11178650 | 4858.347 |
| Haiti | Americas | 2002 | 58.137 | 7607651 | 1270.365 |
| Honduras | Americas | 2002 | 68.565 | 6677328 | 3099.729 |
| Jamaica | Americas | 2002 | 72.047 | 2664659 | 6994.775 |
| Mexico | Americas | 2002 | 74.902 | 102479927 | 10742.441 |
| Nicaragua | Americas | 2002 | 70.836 | 5146848 | 2474.549 |
| Panama | Americas | 2002 | 74.712 | 2990875 | 7356.032 |
| Paraguay | Americas | 2002 | 70.755 | 5884491 | 3783.674 |
| Peru | Americas | 2002 | 69.906 | 26769436 | 5909.020 |
| Puerto Rico | Americas | 2002 | 77.778 | 3859606 | 18855.606 |
| Trinidad and Tobago | Americas | 2002 | 68.976 | 1101832 | 11460.600 |
| United States | Americas | 2002 | 77.310 | 287675526 | 39097.100 |
| Uruguay | Americas | 2002 | 75.307 | 3363085 | 7727.002 |
| Venezuela | Americas | 2002 | 72.766 | 24287670 | 8605.048 |
Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Maybe juxtapose your first attempt and what you obtained after some time spent working on it. Reflect on the differences. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Consult the dimensions listed in All the Graph Things.
Then, make a new graph by converting this visual (or another, if you’d like) to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?
Here is a plot I created early in the semester:
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
scale_x_log10() +
geom_point(colour = "blue", alpha=0.2)
Let’s try to revamp this:
revamp <- gapminder %>%
ggplot(aes(gdpPercap, lifeExp)) +
geom_point(aes(colour=pop), alpha=0.2) +
scale_x_log10( ) +
scale_colour_distiller(
trans = "log10",
breaks = 5^(1:5),
palette = "Blue"
) + theme_light() + labs(title="Life Expectancy and GDP Per Capita") +
ylab("Life Expectancy") +
xlab("GDP Per Capita") +
facet_wrap(~ continent) +
scale_y_continuous(breaks=10*(1:10))
## Warning in pal_name(palette, type): Unknown palette Blue
ggplotly(revamp)